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1. INTRODUCTION 

The 45th Weather Squadron (45 WS) 
provides weather support to Kennedy Space 
Center (KSC) and Cape Canaveral Air Force 
Station (CCAFS) (Harms et al. 1999). Part of 
those weather services include a probability of 
lightning occurrence in the daily 24-Hour and 
Weekly Planning forecasts, which are briefed in 
the morning at 1100 UTC (0700 EDT). The 
probability of lightning occurrence is used in 
planning for daily ground operation activities on 
KSC and CCAFS, as the first step leading to local 
lightning watches and warnings (Weems et al. 
2001), and as part of forecasting the lightning 
launch commit criteria (Roeder and McNamara 
2006). 

Until completion of the Phase II work 
described herein, the lightning probability forecast 
was based on a subjective analysis of model and 
observational data and the output from an 
objective lightning forecast tool developed by the 
Applied Meteorology Unit (AMU) (Bauman et al. 
2004) in Phase I (Lambert and Wheeler 2005). 
This tool was a set of five equations, one for each 
of the warm season months May-September, that 
provided a probability of lightning occurrence for 
the day on KSC/CCAFS. The forecasters 
accessed the equations through a Microsoft® 
Excel© graphical user interface (GUI) by entering 
predictor values. These equations exhibited a 
large improvement in performance over other 
standard forecast methods in use and were 
transitioned to operations in time for the 2005 
warm season. 

Since the Phase I equations were developed, 
new ideas regarding certain predictors were 
formulated and a desire to make the tool more 
automated was expressed by 45 WS forecasters. 
They anticipated that modifying the predictors 
would improve tool performance, and automating 


the tool would reduce the time spent producing 
the forecasts and decrease the likelihood of 
human error in entering values. Phase II, 
therefore, had two parts: 1) to re-examine and 
modify the calculation method of certain predictors 
and to use the modified predictors in developing 
new monthly equations, and 2) to create an 
automated tool in the current operational weather 
display system for the 45 WS, the Meteorological 
Interactive Data Display System (MIDDS). 

2. DATA 

The Period of Record (POR) for the data 
increased from 15 to 17 years and included warm 
season data in the years 1989 - 2005. The data 
sources were the 

• Cloud-to-Ground Lightning Surveillance 
System (CGLSS), 

• 1200 UTC Jacksonville (JAX), Tampa 
(TBW), and Miami (MFL) soundings, and 

• 1 000 UTC CCAFS (XMR) sounding. 

2.1 Cloud-to-Ground Ughtnjnci Surveillance 

System (CGLSS} 

The CGLSS is a network of six sensors 
(Figure 1) that collects date, time, latitude, 
longitude, strength, and polarity information of 
cloud-to-ground (CG) lightning strikes in the local 
KSC/CCAFS area (Boyd et al. 2005). The CGLSS 
data were used to determine whether or not 
lightning occurred on each day in the POR. The 
purpose of the CGLSS data was to create the 
binary predictand for the equations. The data 
were also used to create a daily climatological 
frequency and persistence forecasts that would be 
used as candidate predictors and forecast 
benchmarks against which to test the new 
equations. 
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indicated by the blue dots. The location names 
are next to the circles. The Duda sensor (red 
circle) was moved to the Deseret site in 2005. 

2.2 Florida 1200 UTC Rawinsondes 

These data were collected to determine the 
daily flow regimes using the procedure outlined in 
Lericos et al. (2002), which based the flow regime 
on the mean wind direction in the 1000-700 mb 
layer. As noted in Lericos, the current MFL and 
JAX sites were located at West Palm Beach, FL 
(PBI) and Waycross, GA (AYS), respectively, prior 
to 1995. The PBI and AYS data were used as 
proxies for MFL and JAX, respectively, during the 
period 1989-1994. The map in Figure 2 shows 
the locations of all the soundings used in this task. 

Use of the 1200 UTC sounding may seem 
inappropriate as it cannot provide data in time for 
the 1100 UTC briefing. Use of the 0000 UTC 
sounding from the day before was ruled out as the 
1000-700 mb flow during the Florida warm 
season could be contaminated by afternoon 
convective circulations that mask the larger scale 
flow pattern. For the purpose of determining the 
flow regimes for each day in the POR, the 1200 
UTC sounding provided the most reliable data. 


Due to the weak synoptic patterns during the 
Florida warm season, it is not likely that a flow 
regime change would take place in the two-hour 
period between 1000-1200 UTC. In an 
operational setting, the 45 WS can use several 
data sources, including model output and surface 
observations, to help determine the flow regime of 
the day before the morning 1100 UTC briefing. 

2.3 XMR 1000 UTC Sounding 

The 45 WS forecasters use data from the 
XMR 1000 UTC sounding for the 1100 UTC 
morning briefing since it contains the most recent 
information on the state of the atmosphere over 
the area. These data were used to calculate the 
sounding parameters normally available to the 
forecasters through MIDDS. The parameters were 
used as candidate predictors in the equation 
development. The XMR soundings were also 
used to determine the flow regime of the day with 
the 1200 UTC JAX, TBW, and MFL soundings. 
The procedure is discussed in Section 3.3. 



Figure 2. The red dots on the map show the 
locations of all soundings used in this work. 
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3. MODIFICATIONS 

As stated in Section 1 , the 45 WS requested 
modifications to the Phase I data and candidate 
predictors that could improve their performance. 
Specifically, they requested five modifications: 

• Increase the POR by two years, 

• Modify the valid area, 

• Use the XMR 1000 UTC sounding to help 
determine the flow regime of the day, 

• Use different smoothing values for the 
daily climatology, and 

• Determine an optimal layer for the 
average RH calculation. 

3.1 Increased POR 

Two more warm seasons occurred since the 
Phase I equations were developed. The new POR 
now includes data from all the warm seasons in 
the years 1989-2005. This increased the POR 
from 1 5 to 17 years and was expected to produce 
more robust statistics in the development of the 
equations. 

3.2 New Valid Area 

The equations are meant to forecast lightning 
occurrence within 5 n mi-radius warning circles 
surrounding specific asset locations (Figure 3). 
The valid area for CG lightning occurrence in 
Phase I was the entire area shown in Figure 3, a 
rectangle surrounding all 5 n mi warning circles 
including Astrotech. For Phase II, the 45 WS 
requested that the valid area be reduced to 
include only the 10 circles on KSC (blue circles) 
and CCAFS (red circles) to the right of the vertical 
black line in Figure 3. While the 45 WS has a 
warning responsibility for Astrotech, that area is 
not included in their daily planning forecasts. 

The AMU used an algorithm that calculated 
the distance of each CG from the center of each 
of the 10 circles. The latitude/longitude (lat/lon) 
values from the CGLSS data and the center 
lat/lon of all 10 circles were used in the Great 
Circle Distance Formula found at 
http://www.meridianworlddata.com/distance- 
calculation.asp : 

D = 3437.75 * arccos[sin(lat1 ) * sin(lat2) + 
cos(latl) * cos(lat2) * cos(lon2-lon1)], 


where D is the distance between the strike and 
the circle in n mi, latl/lonl is the lat/lon of the 
circle center, lat2/lon2 is the lat/lon of the strike, 
and all lat/lon values are in radians. A day on 
which CG lightning were within any circle (D <5 n 
mi) was considered a lightning day. The number 
of strikes was not considered for the lightning 
probabilities. 



Figure 3. The 5 n mi lightning warning circles on 
KSC/CCAFS and Astrotech. The valid area for the 
Phase II work is within the four blue (KSC) and six 
red (CCAFS) circles with centers to the right of the 
vertical black line. 

3.3 How Regime Discriminator 

Table 1 shows the definitions for the flow 
regimes used in both Phases. After stratifying the 
days by flow regime in Phase 1 , the AMU found 
that 44% of the days could not be categorized into 
any of the defined regimes. Given that lightning 
occurred on 45% of those days, they could not be 
neglected. Therefore, the AMU stratified them into 
a new flow regime category named ‘Other’. The 
45 WS suggested using the 1000-700 mb mean 
wind direction in the 1000 UTC XMR sounding to 
determine a flow regime for the ‘Other’ days in the 
Phase II work to reduce the number of days in 
‘Other’ and increase the number of days in the 
defined categories so that more robust statistics 
could be calculated. 
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Table 1 . List of the flow regime names used in Phases 1 and II and the corresponding sectors 
showing the average 1000-700 mb wind directions at each of the stations. 

Flow Regime Name and Description 

Rawinsonde Station 
MFL TBW JAX 

SW-1 Subtropical ridge south of MFL 

Southwest flow over KSC/CCAFS 

180°-270° 

180°-270° 

180°-270° 

SW-2 Subtropical ridge north of MFL, south of 
TBW Southwest flow over KSC/CCAFS 

90°-180° 

180°-270° 

180°-270° 

SE-1 Subtropical ridge north of TBW, south of 
JAX Southeast flow over KSC/CCAFS 

90°-180° 

90°-180° 

180°-270° 

SE-2 Subtropical ridge north of JAX 

Southeast flow over KSC/CCAFS 

90°-180° 

90°-180° 

90°-180° 

NW Northwest flow over Florida 

270°-360° 

270°-360° 

270°-360° 

NE Northeast flow over Florida 

0°-90° 

0°-90° 

0°-90° 

Other When the layer-averaged wind directions at 
the three stations did not fit in defined flow 
regime 




Missing One or more soundings missing 





The synoptic flow regime was determined first 
by using a combination of the average 
1000-700 mb wind directions from the 1200 UTC 
MFL, TBW, and JAX soundings, as outlined in 
Lericos et al. (2002). The wind speeds and 
directions were decomposed into u- and v- 
components, and then the layer-average u-and v- 
winds were calculated and recombined to get an 
average wind speed and direction at each 
individual location. Then, the average 1000-700 
mb wind direction in the 1000 UTC XMR sounding 
was calculated and used to determine the ‘local’ 
flow regime of the day. This was the discriminator 
in determining the final flow regime of the day. 
This procedure reduced the number of ‘Other’ 
cases by ~70% and distributed those cases 
among the SW, SE, NW, and NE regimes. 

3.4 Smoother for Daily CHmatglog^ 

The daily climatology values were important 
predictors in all five Phase I equations, and were 
also used as forecast benchmarks when testing 
the performance of the equations. 

The number of years that each day 
experienced lightning was determined, and a raw 
climatology was calculated by dividing this 
number by 17, the number of years in the POR. 


This yielded a fractional value between 0 and 1 for 
each day. The thin blue jagged curve in Figure 4a 
is the raw 17-year climatology for each day in the 
warm season. The noisy appearance of this curve 
is likely due to the few number of years in the 
POR; 17 is a small number of observations from 
which to calculate a climatology. A common 
procedure to minimize the noisiness of such a 
curve is to use a weighted average of the 
observations several days before and after the 
day of interest, artificially increasing the number of 
observations used in order to smooth out the 
curve and infer what the long-term climatology 
would be if enough observations were available. A 
Gaussian weighting function was used to smooth 
the curve, defined by the equation 



XI [w(F n _ k + F n+k) ]+ F n 

k=0 

E[w*2]+1 

k=0 


(Everitt 1 999), 


where W is the Gaussian weighting function 


W = exp 



(Wilks 2006), 
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P = climatological probability on the day of 
interest, 

N = number of years in the POR (17), 
n = day number of interest, 
k = number of days distant from n, 
m = maximum ± number of days distant from n, 

F = raw probability on day of interest, and 
a = scale factor in units of days. 

Using the Phase I values m = ±7 and a = 3 
days for the 17-year POR resulted in the 
smoother red curve in Figure 4a that was still 
somewhat noisy. The 45 WS tested several 
combinations and suggested m = +14 and a = 7 
days to provide the best combination of 
smoothness with minimal changing of the raw 
data. This created a smooth curve, represented 
by the thick blue curve in Figure 4a. The values 
for W are in Figure 4b. 

3.5 Optimal Relative Humidity Layer 

The average RH in the 800-600 mb layer was 
an important predictor in four of the five Phase I 
equations. This parameter was determined as 
valuable in the study that created the Neumann- 
Pfeffer Thunderstorm Index (NPTI) (Neumann 
1971) over 30 years ago. It has been used in 
several studies since that time, but no rigorous 
attempts have been made to determine if 800- 
600 mb is the optimal layer for this parameter. 

An iterative technique was used to determine 
the optimal layer using the 1000 UTC XMR 


sounding. It began by calculating the average RH 
in all 200-mb layers between 950 mb as the 
lowest base and 400 mb as the highest top, 
incrementing the base and top of each layer by 25 
mb. This resulted in 15 layers. The average RH in 
each of the layers was calculated with a log(p)- 
weighted averaging method using all levels in the 
layer. The region of influence, Dj, for each RH 
observation in the sounding was defined as 

D _ [log(Pi-i)-log(pi+i)] . 

1 2 


where Y is the level number, p M is the pressure at 
the observation directly below and p i+1 is the 
pressure directly above p,. The average RH for 
each 200-mb layer was calculated with the 
equation 


RH 


avg 


Z(RH i *D i ) 

ID; 


The 200-mb layer average RH with the highest 
correlation to lightning occurrence was 775-575 
mb, with a center at 675 mb. 


The iterative technique began anew, centered 
at 675 mb and adding layers in 25 mb increments 
above and below to find an optimal thickness. 
This procedure yielded the 825-525 mb layer 
average RH as the most highly correlated to 
lightning occurrence. This is close to the original 
200-mb thick layer of 800-600 mb. The new layer 
is 300-mb thick. 



(b) 


Gaussian Weight Values 



Number of Days from Center Day 


Figure 4. (a) The daily raw (thin blue curve), ±7-day smoothed (red curve), and ±14-day smoothed 
(thick blue curve) climatological probability values of lightning occurrence for the warm-season months 
in 1989-2005, and (b) The Gaussian weight values (W) used in the ±14-day smoothing equation. 
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4. EQUATION ELEMENTS 

The datasets described in Section 2 were 
processed with the modifications described in 
Section 3 to create the elements needed for the 
statistical forecast equation development. The 
necessary elements include a predictand and 
candidate predictors. The predictand is the 
element to be predicted from a predictor or group 
of predictors. 

4.1 Binary Predictand 

The CGLSS data were filtered spatially to 
include only strikes that occurred within the 10 5-n 
mi warning circles shown in Figure 3. Then they 
were filtered temporally to include only lightning 
strikes recorded in the time period 0700-0000 
EDT. The 45 WS morning forecast is issued at 
0700 EDT and is valid for 24 hours. However, the 
45 WS verification procedure is for the current 
day, or Day 1, to end at midnight (0000 EDT). 
They consider times after midnight as Day 2. 
Since the goal of this task was to develop 
equations for Day 1 forecasts, lightning occurring 
between midnight and 0700 EDT were not 
considered. 

Once the data were filtered, the value of the 
binary predictand was set to ‘1’ if lightning was 
detected within the defined time period and area 
on each day, otherwise a ‘0’ was assigned. A 
binary predictand was used because the 
prediction is for lightning occurrence, not the 
number of strikes. The 45 WS verification 
procedure only requires one strike for a lightning 
warning to be validated. 

4.2 Candidate Predictors 

The list of candidate predictors included 1-day 
persistence and daily climatological lightning 
frequency (Figure 4) determined from the CGLSS 
binary predictand, the flow regime probabilities 
from the soundings and CGLSS binary 
predictand, and 1 1 stability parameters calculated 
from the XMR sounding. 

CGLSS Predictors 

The binary predictand discussed in Section 
4.1 was used to create two candidate predictors: a 
binary 1-day persistence and the daily 
climatological probability of lightning occurrence 
described in Section 3.4. Calculation of the 


persistence predictor was straightforward. If 
lightning occurred on a particular day, the 
persistence value for the next day was T. If 
lightning did not occur, the persistence value was 
‘O’. The lightning occurrence information for April 
30 was used to create the persistence value for 
May 1. The values along the thick blue curve in 
Figure 4a were used for the daily climatological 
values of lightning probability. 

Flow Regime Probabilities 

The AMU determined a flow regime for each 
day using the morning JAX, TBW, MFL, and XMR 
soundings as described in Section 3.3. Then, the 
probabilities of lightning occurrence based on flow 
regime for each month and the entire warm 
season were calculated using the CGLSS binary 
predictand. The number of days that each regime 
occurred was compared to the CGLSS predictand 
to see how many of those days experienced 
lightning. The climatological probability was 
calculated by dividing the number of lightning 
days within a particular regime by the total 
number of days the regime occurred. 

The lightning probability values are shown in 
Table 2. The values for each flow regime in each 
month (columns 2-7) were used as candidate 
predictors in the equation development. The 
overall monthly climatologies (column 8) were 
used as forecast benchmarks in determining the 
skill of the equations. 

The SW-1 and SW-2 values for each month 
were within 10% of each other. Therefore, the 
SW-1 and SW-2 days in each month were 
combined to increase the sample size and 
produce a more reliable probability value. The 
resulting combined SW-1/2 values for June, July, 
and August were also within 10% of each other, 
therefore the days for these flow regimes and 
months were combined to create one SW value 
for those three months. In June-August, the SE-1 
and SE-2 regimes were also within 10%, so their 
values were combined to create one SE flow 
regime value for all three months. This was not 
the case for the May and September SE regimes. 
The parentheses around the SE-2 values for 
June-August indicate that it is a combined value 
and the same as SE-1 . 


6 



Table 2. Monthly probabilities of lightning occurrence based on the flow regimes 
that were used as candidate predictors. The values in the far-right column are the 
monthly probabilities for all flow regimes combined, and were used as a forecast 
benchmark. 


Month 

SW- 

1/2 

SE-1 

SE-2 

NW 

NE 

Other 

Monthly 

May 

30 

19 

6 

16 

2 

16 

18 

June 

68 

32 

(32) 

46 

11 

30 

46 

July 

68 

32 

(32) 

53 

14 

43 

48 

August 

68 

32 

(32) 

38 

12 

55 

49 

September 

55 

42 

29 

16 

17 

24 

33 


Stability Indices 

The stability indices calculated from the 1000 
UTC XMR sounding were those normally 
available to the forecasters through MIDDS. In 
order to calculate the same values that would be 
available to the forecasters, the same equations 
used in the MIDDS code were used. The stability 
index candidate predictors included 

• Total Totals (TT), 

• Cross Totals (CT), 

• Vertical Totals (VT), 

• K-Index (Kl), 

• Lifted Index (LI), 

• Thompson Index (Tl), 

• Severe Weather ThrEAT Index 
(SWEAT), 

• Showalter Stability Index (SSI), 

• Temperature at 500 mb (T500), 

• Mean Relative Humidity in the 825- 
525 mb layer (MRH), and 

• Precipitable water up to 500 mb (PW), 

Previous studies and local experience have 
shown the Kl and LI to be useful in predicting the 
likelihood of thunderstorms at KSC/CCAFS (Kelly 
et al. 1998, Howell 1998, and Everitt 1999). The 
Tl should also be a good predictor since it is the 
difference between Kl and LI. The other stability 
indexes were included as candidate predictors to 
provide an unbiased and complete selection. 

The MIDDS uses the Man-computer 
Interactive Data Access System (MclDAS) 
software (Lazzara et al. 1999) for processing 


sounding data. The formulas in the MclDAS code 
used for the indices are standard and can be 
found in several sources (e.g. Peppier and Lamb 
1989; Ohio State University Severe Weather 
Products web page at http://twister.sbs.ohio- 
state.eduL 

5. EQUATION DEVELOPMENT AND TESTING 

The AMU developed a set of five equations, 
one for each month in the warm season. The 
performance of the equations was assessed using 
several verification techniques appropriate for 
probability forecasts. 

5.1 Development and. Verification Datasets 

Of the 17 warm seasons in the POR, 14 were 
used for equation development and 3 were set 
aside for equation verification. This ensured that 
each month in the warm season was equally 
represented in both datasets. 

The stratification did not involve choosing 
individual warm season years for each dataset, 
but rather individual warm season days were 
chosen randomly. Days for the verification dataset 
were chosen first. Given that there are 153 days 
in the warm season, the random number 
generator in Excel was used to create three sets 
of 153 numbers representing the years 1989 - 
2005. The resulting three sets of years were 
assigned to each day in the warm season. Thus, 
each day in the warm season was represented by 
days from three random years. Care was taken to 
ensure the there were no duplicate years for each 
day from the random number generator. As an 
example, the verification dataset contains 1 May 
1989/1999/2001, 
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2 May 1993/1998/2000, etc. All other dates were 
made part of the development dataset. This 
random-day method was chosen to reduce the 
likelihood that any unusual convective seasons 
would bias the results. 

5.2 Eguatjon Development 

Final predictor selection was conducted for 
each individual month due to the possibility that 
different variables may become more critical to 
convection formation as the warm season 
progresses. 

Logistic Regression 

According to Wilks (2006), logistic regression 
is the appropriate method when the predictand is 
binary. Logistic regression was chosen as the 
statistical method in Phase I due to the binary 
nature of the predictand and also due to results 
from a previous study. Everitt (1999) showed that 
logistic regression yielded 48% better skill over 
the linear regression equations in NPTI when 
using the same predictor variables and data. The 
gain in skill was due solely to use of the logistic 
regression method. Given a predictand, y, and a 
set of predictors x,..x k , where k is the total number 
of predictors, logistic regression is represented by 

e (b 0 +b,x,+...+b k x k ) 

^ “ j + e (bo+b,Xi+...+b k Xk) ’ 

where b n . . ,b k are the coefficients for the 
corresponding predictors. 

Residual Deviance Calculation 

The contribution of each candidate predictor 
to the reduction in variance was determined by 
the residual deviance. The parameter serves the 
same role in logistic regression as does the 
residual sum of squares in a linear regression 
(Insightful Corporation 2005a). Menard (2000) 
examined several methods that help determine 
the amount of predictand variance explained by 
predictors in logistic regression equations. The 
preferred method in that study was determining 
the percentage drop in the residual deviance 
when a new predictor was added. Therefore, it 
was the method employed in Phases I and II. 

Predictor Selection 

The procedure to develop a logistic regression 
equation outlined in the S-PLUS User’s Manual 
(Insightful Corporation 2005b) was used to create 


the equations. The candidate predictors were 
added to a logistic regression equation one-by- 
one and their contribution to the reduction in 
residual deviance noted. While more automatic 
predictor selection methods in S-PLUS could have 
been employed, this manual process allowed for 
more control over understanding how each 
candidate predictor contributed to the reduction in 
residual deviance individually and in combination 
with other predictors. It was also facilitated by the 
relatively small number of candidate predictors 
available for selection. 

The single candidate predictor that affected 
the largest reduction in the residual deviance was 
chosen as the first predictor in the equation. The 
second candidate predictor that reduced the 
residual deviance by the largest amount in 
combination with the first was chosen as the 
second predictor. This process was followed using 
all candidate predictors. Figure 5 shows the 
percent reduction in residual deviance from the 
NULL model from the first eight predictors added 
for June as an example. The Tl reduced the 
residual deviance by the most (19%) and was, 
therefore, the first predictor. The second predictor 
was the flow regime lightning probability (FRProb 
in Figure 5), which accounted for an additional 9% 
reduction in residual deviance. The third predictor 
was persistence (Pers in Figure 5), reducing the 
residual deviance by 1 %, and so on. 

While all candidate predictors were used in 
each equation to determine their rank order, all 
could not be used in the final equations as that 
could lead to overfitting (Wilks 2006). When this 
happens, the equations perform well with data 
from the development dataset, but perform poorly 
on the verification set. Wilks (2006) suggests 
several ‘stopping rules’, the point at which no 
more predictors will be added to the equation. In 
Phase II, the AMU used a similar stopping rule to 
that in Phase I. Scree diagrams showing the 
reduction in residual deviance, like Figure 5, were 
created for each month (not shown). The AMU 
then examined the change in slope of the curves 
as each predictor was added. The change in the 
slope of the curve after MRH in Figure 5 is at the 
point where the residual deviance is reduced by 
less than 0.5%. Similar changes in slope were 
found in the other four months for the same cutoff 
value. The stopping rule was to stop adding 
predictors when a candidate predictor effected a 
< 0.5% change in the residual deviance that came 
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after it. The AMU tested equations with one more 
and one less predictor than the stopping rule 
indicated, only to conclude that the equation 
created with the < 0.5% stopping rule had superior 
performance. 



Figure 5. The total percent reduction in residual 
deviance from that of the NULL model as each 
predictor was added to the equation using the 
June development dataset. 

Final Predictors 

This predictor selection process was repeated 
for all five months in the warm season. Table 3 
shows the final predictors for each of the monthly 
equations in rank order of their reduction in 
residual deviance. The predictor names are color- 
coded according to the number of equations in 


which they appear. Red indicates that a predictor 
was chosen in every equation, i.e. it proved useful 
throughout the entire warm season. There was 
only one: the probability of lightning occurrence 
based on the flow regime. It was also ranked 
second in every month, underscoring its 
importance as a predictor in the KSC/CCAFS 
area. Blue identifies the two predictors, VT and 
persistence, that were chosen in four of the five 
equations. The July equation did not use VT and 
the August equation did not use Pers. The July 
equation did include VT indirectly, given that it is 
in the equation that calculates TT. The predictors 
in green were chosen for three of the equations, 
and they are the daily climatology, Tl, and MRH. 
These are followed by the predictors in black, 
which were only used in one equation each: Kl 
and TT. The June, July, and August equations did 
include Kl indirectly in the equation for Tl. 

The most important predictors in the May 
through August equations, Kl or Tl, account for 
instability and moisture in the profile, which are 
both necessary ingredients for thunderstorm 
formation. September has MRH as the most 
important predictor, which only accounts for the 
mid-level moisture. The fourth predictor, VT, 
accounts for mid-level instability, but it has a much 
smaller influence on the probability in September 
due to its rank. The flow regime probability as the 
second predictor accounts for the lifting 
mechanism, or lack thereof, as the low-level flow 
interacts with the sea breeze that occurs almost 
daily in the warm season. 


Table 3. The final predictors for each monthly equation, in rank order of their reduction in residual 
deviance. The predictors in red were in every equation, the predictors in blue were in four of the five 
equations, the predictors in green were in three of the five equations, and the predictors in black were 
in only one equation. 

May 

June 

July 

August 

September 

K-Index 
Flow Regime 
Vertical Totals 
Daily Climatology 
Persistence 

Thompson Index 
Flow Regime 
Persistence 
Vertical Totals 
825-525 mb 
MRH 

Thompson Index 
Flow Regime 
Total Totals 
Persistence 

Thompson Index 
Flow Regime 
Daily Climatology 
825-525 mb MRH 
Vertical Totals 

825-525 mb MRH 
Flow Regime 
Persistence 
Vertical Totals 
Daily Climatology 
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5.3 Eguatjon Perfomance 

The predictors from the verification dataset 
were used in the equations to produce ‘forecast’ 
probabilities. The forecast probabilities were 
compared with the binary lightning observations in 
the verification dataset using four tests that 
measured different aspects of forecast 
performance. They were the 

• Brier Skill Score, which is a measure of 
equation performance versus other 
standard forecast methods, 

• Distributions of the probability forecasts 
for days with and without lightning, 

• Reliability of the observed lightning 
frequency as a function of the forecast 
probability, and 

• Categorical contingency table statistics. 

Brier Skill Score 

This test determined if the P-2 equations 
showed improvement in skill over five forecast 
benchmarks: 

• Persistence, 

• Daily climatology (Figure 4a), 

• Flow regime probabilities (Table 2), 

• Monthly climatology (Table 2), and 

• P-1 equations. 

The mean square error (MSE) between the 
probability forecasts and observations was 
calculated using the equation 


MSE = - y ( Pi - Oj ) 2 (Wilks 2006), 

n j = i 


where n is the number of forecast/observation 
pairs, Pi is the probability associated with the 
forecast method, and Oj is the corresponding 
binary lightning observation. The skill of the P-2 
equations relative to the five forecast benchmarks 
was calculated using the equation for the Brier 
Skill Score (SS): 


SS = 


MSE eqn ~MSE ref 
v MSE perfect - MSE ref 


A 

* 

) 


100 (Wilks 2006), (10) 


where MSE eqn was the MSE of the P-2 equations, 
MSE re f was the forecast benchmark against which 
the new equations were tested, and MSEperfect was 
the MSE of a perfect forecast, which is always 0. 
The SS represents a percent improvement or 
degradation in skill of the equation over the 
reference forecast when it is positive or negative, 
respectively. 


The SS values for each of the monthly P-2 
equations and a composite result for the full warm 
season are shown in Table 4. The P-2 equations 
show a double-digit improvement in skill for the 
first four benchmarks in the table, similar to the 
results for the P-1 equations (Lambert and 
Wheeler 2005). The P-2 equations also show an 
8% improvement in skill over the P-1 equations for 
the composite warm season. For the individual 
months, the P-2 equations show an improvement 
in skill over the P-1 equations for June, July, and 
September. The values of 0.2% for May and - 
0.8% for August are very small and indicate 
similar skill between the two equation sets. 


Table 4. The SS values showing the percent improvement (degradation) in 
skill of the P-2 equations over of persistence, daily and monthly climatologies, 
flow regime probabilities, and the P-1 equations developed in Lambert and 
Wheeler (2005). These scores were calculated using the verification data for 

each month and for the entire warm season (All). 




Forecast Method 

May 

Jun 

Jul 

Aug 

Sep 

All 

Persistence 

28 

41 

37 

47 

41 

40 

Daily Climatology 

23 

25 

24 

24 

26 

25 

Monthly Climatology 

29 

27 

34 

30 

25 

29 

Flow Regime 

16 

12 

11 

18 

18 

15 

P-1 Equations 

0.2 

5 

19 

(-0.8) 

12 

8 
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Probability Distributions 

The P-1 and P-2 equation probability 
forecasts from the verification dataset were 
stratified by lightning and non-lightning days, and 
created a probability distribution for each. Such 
distributions show how well the equations 
distinguish between lightning and non-lightning 
days. Figure 6 shows the probability distributions 
for lightning days represented by the two red 
curves, and for non-lightning days represented by 
the two blue curves. For good performance, one 
would expect the blue curves to have a maximum 
in the lower probability values decreasing to a 
minimum at higher values, and the red curves to 
have a minimum in the lower probability values 
increasing to a maximum at the higher values. 



Figure 6. Forecast probability distributions for 
lightning (red) and non-lightning (blue) days in the 
verification data. The solid lines represent the P-2 
equations and the dashed lines represent the P-1 
equations. The y-axis values are the frequency of 
occurrence of each probability value, and the x- 
axis values are the forecast probability values 
output by the equations. 

Both blue curves for the non-lightning days in 
Figure 6 peak at 0.2 probability, decrease rapidly 
through 0.4, and then decrease more slowly 
toward 1. This indicates good performance for 
both equation sets. However, the P-1 equations 
distinguished non-lightning days with a bit more 
accuracy as evidenced by the higher peak of 59% 
versus 50% at 0.2, and the larger drop off to 21% 
versus 24% to 0.4. The percent occurrence for the 
P-1 equations remained ~1% below those of the 
new equations from 0.4 to 1 . 


The red curve for the P-2 equations indicates 
that they distinguished lightning days more 
accurately than the P-1 equations. The percent 
occurrences of the P-2 probabilities were lower 
than those for the P-1 equations for all probability 
values less than 0.7, and higher for all 
probabilities greater than 0.7. The peak percent 
occurrence for the P-2 equations was 36% at 0.8 
probability, while the peak for the P-1 equations 
was 30% at 0.6 probability. 

Reliability Diagram 

A reliability diagram (Wilks 2006) indicates 
how the equations perform in terms of under- or 
over-forecasting lightning occurrence at discrete 
probability values from 0 to 1 in increments of 0.1 . 
The equations do not output probability values in 
such discrete intervals. Therefore, the probability 
values were organized into bins according to their 
rounded value. The number of ‘yes’ lightning 
observations in each bin were divided by the total 
number of forecast/observation pairs in each bin 
to get a reliability value for that bin. For example, 
if there were 10 pairs assigned to the 0.1 bin and 
one of the observations was ‘yes’ for lightning, the 
reliability would be [1 ‘yes’ observation] / [10 
forecast/observation pairs] or 0.1. The forecast is 
said to exhibit perfect reliability when the reliability 
is equal to the bin value. 

The reliability diagrams for the P-1 and P-2 
equations are shown in Figure 7. The black 
diagonal line represents perfect reliability. Where 
the curves are below the black line, the equations 
over-forecasted lightning occurrence, and where 
the curves are above the line, the equations 
under-forecast lightning occurrence. Both curves 
are mostly above the black line, indicating a 
tendency to under-forecast lightning occurrence. 
The red curve for the P-2 equations was closer to 
the perfect reliability diagonal than the blue curve 
for the P-1 equations for all probabilities except for 
0.6 and 0.7. The number of samples used to 
calculate the reliability decreased with increasing 
forecast probability, which may account for the 
20% over-forecast at 0.9 forecast probability. The 
increasing variability of the P-2 equations at 
higher probabilities is likely an artifact of the 
relatively small sample sizes at those larger 
probabilities and that variation is likely not 
statistically significant. Overall, these curves 
demonstrate that the P-2 equations have better 
reliability than the P-1 equations. 
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Reliability Diagram for the New Equations 
May-September 1989-2005 
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Forecast Probability 

Figure 7. Reliability diagram of the P-1 and P-2 
probability forecasts for all months. The black 
diagonal line represents perfect reliability, the blue 
curve represents the reliability of the P-1 
equations, and the red curve represents the 
reliability of the P-2 equations. The histogram at 
the lower right shows the number of observations 
in each probability range for the old (blue) and 
new (red) forecast methods. 

The extent of the under-forecasting was 
quantified by calculating the bias for each 
equation set. The forecasts and observation pairs 
for the entire warm season were used in the bias 
calculation. The average bias in percent was 
calculated using the equation 

Z(Pi-°,) 

B = * 100, 

N 

where B is the bias in percent, Oi is the binary 
lightning observation, Pi is the associated 
probability forecast, and N is the number of 
observation/forecast pairs. The bias was -5.9% for 
the P-1 equations and -0.4% for the P-2 
equations. The P-2 equations reduced the under- 
forecast bias by 4.5%. 

Contingency Table Statistics 

Forecast verification using a contingency table 
is most appropriate for categorical forecasts in 
which a phenomenon is forecast to occur or not. It 
is a less appropriate method for probability 
forecasts that express levels of uncertainty in 
which no probability value in the range 0-1 is 
necessarily wrong or right (Wilks 2006). 
Nonetheless, it is a familiar and easily understood 
method that can shed light on forecast 
performance provided an appropriate probability 



threshold value is defined, above which the 
forecast will be considered ‘yes’ and below which 
the forecast will be considered ‘no’. 

The proper threshold, or cutoff, value depends 
on the forecast decision issue to which the user 
will apply the forecast (Wilks 2006). The original 
goal of Phase I was to create equations that 
performed better than persistence, since 
persistence performed better than NPTI. The goal 
of Phase II was to improve the performance of the 
equations further. The AMU used the condition for 
the P-1 and P-2 equations that the threshold value 
chosen must outperform the persistence forecast 
for all of the contingency table values. The 
procedure used tested probability values from 0.1 
to 0.9 in increments of 0.01. Details of the 
procedure are in Lambert (2007). The resulting 
threshold values were 0.35 for the P-1 equations 
and 0.47 for the P-2 equations. Table 5 contains 
the accuracy measures and skill scores for each 
set of equations using the cutoff values and those 
of their associated persistence forecasts. The P-2 
equations exhibited better accuracy and skill than 
persistence and the P-1 equations. 


Table 5. The accuracy measures and skill 
scores for the P-2 equations with a threshold 
cutoff probability of 0.47, the P-1 equations with 
a cutoff probability of 0.35, and the persistence 
(Pers) forecasts associated with each set of 
equations. 

Statistic 

P-2 

(0.47) 

Pers 

(P-2) 

P-1 

(0.35) 

Pers 

(P-1) 

POD 

0.68 

0.62 

0.66 

0.63 

FAR 

0.21 

0.23 

0.23 

0.24 

HR 

0.74 

0.71 

0.73 

0.71 

CSI 

0.52 

0.46 

0.50 

0.47 

HSS 

0.47 

0.40 

0.44 

0.40 

KSS 

0.47 

0.39 

0.44 

0.40 


Equation Performance Summary 


The main goal for this task was to create new 
lightning probability forecast equations that would 
outperform the P-1 equations currently used in 
operations. The new P-2 equations did outperform 
the P-1 equations as evidenced by the four tests. 


12 



The SS values indicated that the equations 
showed an increase in skill over daily and monthly 
lightning climatology, persistence, and the flow 
regime lightning probabilities. For the entire warm 
season, the P-2 equations showed an 8% 
increase in skill over the P-1 equations. The P-1 
equations showed a 48% improvement in skill 
over the NPTI in Phase I With the additional 8% 
gain in skill over the P-1 equations, the P-2 
equations showed a total gain in skill of 56% over 
the NPTI. The P-1 equations were slightly better 
at distinguishing non-lightning days, but the P-2 
equations were better at distinguishing lightning 
days (Figure 6). The P-2 equations demonstrated 
an improved reliability over the P-1 equations, and 
reduced the overall negative bias by almost 5% 
(Figure 7). Finally, the P-2 equations had the best 
accuracy measures and skill scores compared it 
their associated persistence forecasts and the P-1 
equations (Table 5). Since most of the tests 
indicated that the P-2 equations exhibited superior 
performance over the P-1 equations, they 
replaced the P-1 equations before the start of the 
2007 lightning season. 

6. GRAPHICAL USER INTERFACE 

In Phase I, the AMU created a GUI in Excel to 
facilitate user-friendly input to the equations and 
fast, easy-to-read output. The 45 WS was 
involved in the GUI development by providing 
comments and suggestions on the design to 
ensure the final product addressed their 
operational needs. With this GUI, the forecasters 
had to gather the predictor values from one 
system and enter them in the GUI on a separate 
computer. This step used time that could be spent 
doing other required duties, and increased the risk 
of entering an incorrect value resulting in an 
erroneous probability value. As part of Phase II, 
the AMU, assisted by Mr. Paul Wahner of 
Computer Sciences Raytheon (CSR), developed 
a similar GUI in MIDDS that gathers the required 
predictor values from the sounding automatically. 

The MIDDS GUI was developed with the Tool 
Command Language (Tcl)/Toolkit (Tk) capability 
in MIDDS. The design and function of this GUI is 
similar to the Excel GUI, but goes one step further 
by loading the sounding predictor values 
automatically into the dialog box. This removes 
the risk of a forecaster entering an incorrect value 
while also reducing the time the forecaster would 
spend gathering and calculating the required 


parameter values. The GUI has two dialog boxes: 
the first displays the date and asks for equation 
predictor values, and the second displays the 
equation output. 

Equation Predictor Dialog Box 

The equation predictor dialog box(Figure 8) is 
displayed first. The dialog box has five tabs, one 
for each month. The tab of the current month is 
displayed initially. The current month, day and 
sounding time are printed along the top of the 
dialog box. If the current day’s sounding is not 
available, “No Current Sounding” will be displayed 
in place of the date and time in the upper right. 
The day value can be changed by the up/down 
arrows or by entering a value manually in the text 
box. This allows forecasters flexibility when 
making the seven-day Weekly Planning Forecast. 
The sounding date and time is formatted by year, 
day of year, and UTC time. 



Figure 8. Equation predictor dialog box for June. 
A tab for each month is at the top, followed by the 
date and sounding time, then the predictor values. 
Clicking the “Dismiss” button closes the GUI, the 
“Reset Parameters” button resets the sounding 
stability parameters to original values, and the 
“Calculate Probability” button displays the 
probability output dialog box (Figure 9). 
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FUTURE WORK 


The user chooses Yes or No for persistence, 
and then a flow regime. They do not have to enter 
the sounding parameters since those values are 
already input by the GUI code and displayed in 
their associated text boxes. If there is not a 
current sounding, the text boxes will be populated 
with the values from the most recent sounding 
available. The “No Current Sounding” message in 
the top right corner will inform the forecaster that 
this is the case. 

The final step is to click on the “Calculate 
Probability” button in the lower right corner of the 
dialog box. The “Dismiss” button in the lower left 
closes the GUI. If the forecaster does not choose 
a persistence value or flow regime, one of two 
error messages is displayed informing the 
forecaster that a choice needs to be made (not 
shown) 


7. 

At the AMU Tasking Meeting in April 2007, a 
third phase to this task was approved in which two 
changes will be made in an effort to further 
improve equation performance. The first will be to 
include October data to the current POR. In 
looking at Figure 4a, it appears that the end of the 
lightning season is beyond September 30. The 
daily climatology values at the end of September 
are approximately 10% higher than at the 
beginning of May. The second modification will be 
to stratify the warm seasons by the progression of 
the daily climatology instead of by month. 
Stratifying the data by the progression of the daily 
climatology through the warm season is more 
meteorologically representative than by date. The 
AMU will develop a method to determine four to 
five sub-seasons in the overall warm season: 


Output 


When the user clicks the “Calculate 
Probability” button in the equation predictor dialog 
box, the probability of lightning occurrence for the 
day is displayed in a dialog box (Figure 9). The 
GUI code also outputs a file that contains all the 
parameter values input by the user to calculate 
the probability for later reference. 


El _ LU 


The probability of lightning being 
observed in at least one of the 
KSC/CCAFS advisory circles on 
Jun 8, from 0700 - 2400 EDT is: 


80 % 


OK 


Figure 9. The dialog box displaying the probability 
of lightning occurrence for the day as calculated 
by the equation. Clicking the “OK” button closes 
the box. 


• A pre-lightning season in early May, 

• A spin-up transition season from mid to 
late May to early or mid June, 

• The core lightning season from early to 
mid June to mid or late August, 

• A spin-down transition season through 
September, and 

• A possible post-lightning season in 
October. 

The number of sub-seasons cannot be 
determined until the October data are added and 
a new daily climatology chart is created. One 
equation will be developed for each sub-season 
and their performance compared to the P-2 
equations. 

Data from more years are planned to be 
added in future phases. This will help develop 
more robust statistical relationships in the 
equations and provide more data for verification. 
Also, new techniques may be available over the 
next few years that could help improve equation 
performance. Any such techniques should be 
considered and tested in future phases. 
Evaluation of equation performance should be 
done continuously to determine the tool’s 
strengths and weaknesses, which can be used to 
guide future modifications. 
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8. SUMMARY AND CONCLUSIONS 

The AMU created five logistic regression 
equations that predict the probability of cloud-to- 
ground lightning occurrence for the day in the 
KSC/CCAFS 5 n mi warning circles for each 
month in the warm season. These equations are 
based on equations developed in Phase 1 
(Lambert and Wheeler 2005), but with five 
modifications: 

• Increase the POR from 15 to 17 years 
(1989-2005), 

• Modify the valid area to eliminate areas 
that are not within the 5 n mi warning 
circles, 

• Include the XMR 1000 UTC sounding in 
determining the flow regime of the day, 

• Use a different smoothing function for the 
daily lightning climatology, and 

• Determine the optimal layer for the 
average RH calculation. 

The P-2 equations described in this report 
outperformed the P-1 equations by an overall 8%, 
and showed better performance than the P-1 
equations in four other tests. As a result, the new 
P-2 equations were added to the current set of 
tools used by the 45 WS to determine the 
probability of lightning for their daily planning 
forecast. The details of the Phase II work can be 
found in Lambert (2007). 

Results from the P-2 equations are meant to 
be used as first-guess guidance when developing 
the lightning probability forecast for the day. They 
provide an objective base from which forecasters 
can use other observations, model data, 
consultation with other forecasters, and their own 
experience to create the final lightning probability 
for the 1100 UTC briefing. 
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